52 research outputs found
Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English
The necessity of using a fixed-size word vocabulary in order to control the
model complexity in state-of-the-art neural machine translation (NMT) systems
is an important bottleneck on performance, especially for morphologically rich
languages. Conventional methods that aim to overcome this problem by using
sub-word or character-level representations solely rely on statistics and
disregard the linguistic properties of words, which leads to interruptions in
the word structure and causes semantic and syntactic losses. In this paper, we
propose a new vocabulary reduction method for NMT, which can reduce the
vocabulary of a given input corpus at any rate while also considering the
morphological properties of the language. Our method is based on unsupervised
morphology learning and can be, in principle, used for pre-processing any
language pair. We also present an alternative word segmentation method based on
supervised morphological analysis, which aids us in measuring the accuracy of
our model. We evaluate our method in Turkish-to-English NMT task where the
input language is morphologically rich and agglutinative. We analyze different
representation methods in terms of translation accuracy as well as the semantic
and syntactic properties of the generated output. Our method obtains a
significant improvement of 2.3 BLEU points over the conventional vocabulary
reduction technique, showing that it can provide better accuracy in open
vocabulary translation of morphologically rich languages.Comment: The 20th Annual Conference of the European Association for Machine
Translation (EAMT), Research Paper, 12 page
Logographic Information Aids Learning Better Representations for Natural Language Inference
Statistical language models conventionally implement representation learning
based on the contextual distribution of words or other formal units, whereas
any information related to the logographic features of written text are often
ignored, assuming they should be retrieved relying on the cooccurence
statistics. On the other hand, as language models become larger and require
more data to learn reliable representations, such assumptions may start to fall
back, especially under conditions of data sparsity. Many languages, including
Chinese and Vietnamese, use logographic writing systems where surface forms are
represented as a visual organization of smaller graphemic units, which often
contain many semantic cues. In this paper, we present a novel study which
explores the benefits of providing language models with logographic information
in learning better semantic representations. We test our hypothesis in the
natural language inference (NLI) task by evaluating the benefit of computing
multi-modal representations that combine contextual information with glyph
information. Our evaluation results in six languages with different typology
and writing systems suggest significant benefits of using multi-modal
embeddings in languages with logograhic systems, especially for words with less
occurence statistics.Comment: accepted by aacl finding
Vision Matters When It Should: Sanity Checking Multimodal Machine Translation Models
Multimodal machine translation (MMT) systems have been shown to outperform their text-only neural machine translation (NMT) counterparts when visual context is available. However, recent studies have also shown that the performance of MMT models is only marginally impacted when the associated image is replaced with an unrelated image or noise, which suggests that the visual context might not be exploited by the model at all. We hypothesize that this might be caused by the nature of the commonly used evaluation benchmark, also known as Multi30K, where the translations of image captions were prepared without actually showing the images to human translators. In this paper, we present a qualitative study that examines the role of datasets in stimulating the leverage of visual modality and we propose methods to highlight the importance of visual signals in the datasets which demonstrate improvements in reliance of models on the source images. Our findings suggest the research on effective MMT architectures is currently impaired by the lack of suitable datasets and careful consideration must be taken in creation of future MMT datasets, for which we also provide useful insights
A Latent Morphology Model for Open-Vocabulary Neural Machine Translation
Translation into morphologically-rich languages challenges neural machine
translation (NMT) models with extremely sparse vocabularies where atomic
treatment of surface forms is unrealistic. This problem is typically addressed
by either pre-processing words into subword units or performing translation
directly at the level of characters. The former is based on word segmentation
algorithms optimized using corpus-level statistics with no regard to the
translation task. The latter learns directly from translation data but requires
rather deep architectures. In this paper, we propose to translate words by
modeling word formation through a hierarchical latent variable model which
mimics the process of morphological inflection. Our model generates words one
character at a time by composing two latent representations: a continuous one,
aimed at capturing the lexical semantics, and a set of (approximately) discrete
features, aimed at capturing the morphosyntactic function, which are shared
among different surface forms. Our model achieves better accuracy in
translation into three morphologically-rich languages than conventional
open-vocabulary NMT methods, while also demonstrating a better generalization
capacity under low to mid-resource settings.Comment: Published at ICLR 202
COLECCIÓN ANTONIO GONZÁLEZ. CRONISTA OFICIAL DE TELDE [Material gráfico]
Copia digital. Madrid : Ministerio de Educación, Cultura y Deporte. Subdirección General de Coordinación Bibliotecaria, 201
On the Importance of Word Boundaries in Character-level Neural Machine Translation
Neural Machine Translation (NMT) models generally perform translation using a fixed-size lexical vocabulary, which is an important bottleneck on their generalization capability and overall translation quality. The standard approach to overcome this limitation is to segment words into subword units, typically using some external tools with arbitrary heuristics, resulting in vocabulary units not optimized for the translation task. Recent studies have shown that the same approach can be extended to perform NMT directly at the level of characters, which can deliver translation accuracy on-par with subword-based models, on the other hand, this requires relatively deeper networks. In this paper, we propose a more computationally-efficient solution for character-level NMT which implements a hierarchical decoding architecture where translations are subsequently generated at the level of words and characters. We evaluate different methods for open-vocabulary NMT in the machine translation task from English into five languages with distinct morphological typology, and show that the hierarchical decoding model can reach higher translation accuracy than the subword-level NMT model using significantly fewer parameters, while demonstrating better capacity in learning longer-distance contextual and grammatical dependencies than the standard character-level NMT model
Compositional Source Word Representations for Neural Machine Translation
The requirement for neural machine translation (NMT) models to use fixed-size input and output vocabularies plays an important role for their accuracy and generalization capability. The conventional approach to cope with this limitation is performing translation based on a vocabulary of sub-word units that are predicted using statistical word segmentation methods. However, these methods have recently shown to be prone to morphological errors, which lead to inaccurate translations. In this paper, we extend the source-language embedding layer of the NMT model with a bi-directional recurrent neural network that generates compositional representations of the source words from embeddings of character n-grams. Our model consistently outperforms conventional NMT with sub-word units on four translation directions with varying degrees of morphological complexity and data sparseness on the source side
FBK’s Neural Machine Translation Systems for IWSLT 2016
In this paper, we describe FBK’s neural machine translation (NMT) systems submitted at the International Workshop on Spoken Language Translation (IWSLT) 2016. The systems are based on the state-of-the-art NMT architecture that is equipped with a bi-directional encoder and an attention mechanism in the decoder. They leverage linguistic information such as lemmas and part-of-speech tags of the source words in the form of additional factors along with the words. We compare performances of word and subword NMT systems along with different optimizers. Further, we explore different ensemble techniques to leverage multiple models within the same and across different networks. Several reranking methods are also explored. Our submissions cover all directions of the MSLT task, as well as en-{de, fr} and {de, fr}-en directions of TED. Compared to previously published best results on the TED 2014 test set, our models achieve comparable results on en-de and surpass them on en-fr (+2 BLEU) and fr-en (+7.7 BLEU) language pairs
Evaluating Multiway Multilingual NMT in the Turkic Languages
Despite the increasing number of large and comprehensive machine translation (MT) systems, evaluation of these methods in various languages has been restrained by the lack of high-quality parallel corpora as well as engagement with the people that speak these languages. In this study, we present an evaluation of state-of-the-art approaches to training and evaluating MT systems in 22 languages from the Turkic language family, most of which being extremely under-explored. First, we adopt the TIL Corpus with a few key improvements to the training and the evaluation sets. Then, we train 26 bilingual baselines as well as a multi-way neural MT (MNMT) model using the corpus and perform an extensive analysis using automatic metrics as well as human evaluations. We find that the MNMT model outperforms almost all bilingual baselines in the out-of-domain test sets and finetuning the model on a downstream task of a single pair also results in a huge performance boost in both low- and high-resource scenarios. Our attentive analysis of evaluation criteria for MT models in Turkic languages also points to the necessity for further research in this direction. We release the corpus splits, test sets as well as models to the public.Peer reviewe
- …